5 Probability

The usual touchstone of whether what someone asserts is mere persuasion or at least a subjective conviction, i.e., firm belief, is betting. Often someone pronounces his propositions with such confident and inflexible defiance that he seems to have entirely laid aside all concern for error. A bet disconcerts him. Sometimes he reveals that he is persuaded enough for one ducat but not for ten. For he would happily bet one, but at ten he suddenly becomes aware of what he had not previously noticed, namely that it is quite possible that he has erred. -— Immanuel Kant, Critique of Pure Reason

The central tension, and opportunity, in data science is the interplay between the data and the science, between our empirical observations and the models which we use to understand them. Probability is the language we use to explore that interplay; it connects models to data, and data to models.

The only package students need in this chapter is tidyverse.

5.1 Probability distributions

Dice and Probability.

FIGURE 5.1: Dice and Probability.

What does it mean that Trump had a 30% chance of winning re-election in the fall of 2020? That there is a 90% probability of rain today? That the dice at the casino are unfair?

Probability is about quantifying uncertainty. We can think of probability as a proportion. The probability of an event occurring is a number from 0 to 1, where 0 means that the event is impossible and 1 means that the event is 100% certain.

Let’s begin with the simplest events: coin flips and dice rolls. If the dice and the coins are fair, we can operate under the assumption that all outcomes are equally likely.

This allows us to make the following statements:

  • The probability of rolling a 1 or a 2 is 2/6, or 1/3.
  • The probability of rolling a 1, 2, 3, 4, 5, or 6 is 1.
  • The probability of flipping a coin and getting tails is 1/2.

For the purposes of this Primer, a probability distribution is a mathematical object that covers a set of outcomes, where each distinct outcome has a chance of occurring between 0 and 1 inclusive. The chances must sum to 1. The set of possible outcomes — heads or tails for the coin, 1 through 6 for a single die, 2 through 12 for a pair of dice — can be either discrete or continuous. This set of outcomes is the domain of the probability distribution. There are three types of probability distributions: mathematical, empirical, and posterior.

The key difference between a distribution, as we have explored them Section 2.8, and a probability distribution is the requirement that the sum of the probabilities of the individual outcomes must be exactly 1. There is no such requirement in a distribution. But any distribution can be turned into a probability distribution by “normalizing” it, as we will explore. In this context, we will often refer to a distribution which is not (yet) a probability distribution as an “unnormalized” distribution.

Pay attention to notation. Whenever we are talking about a specific probability (represented by a single value), we will use \(\rho\) (the Greek letter “rho” but spoken aloud as “p” by us) with a subscript which specifies the exact outcome of which it is the probability. For instance, \(\rho_h = 0.5\) denotes the probability of getting heads on a coin toss when the coin is fair. \(\rho_t\) — spoken as “PT” or “P sub T” or “P tails” — denotes the probability of getting tails on a coin toss. However, when we are referring to the entire probability distribution over a set of outcomes, we will use \(\text{Prob}()\). For example, the probability distribution of a coin toss is \(\text{Prob}(\text{coin})\). That is, \(\text{Prob}(\text{coin})\) is composed of the two specific probabilities (50% and 50%) mapped from the two values in the domain (Heads and Tails). Similarly, \(\text{Prob}(\text{sum of two dice})\) is the probability distribution over the set of 11 outcomes (2 through 12) which are possible when you take the sum of two dice. \(\text{Prob}(\text{sum of two dice})\) is made up of 11 numbers — \(\rho_2\), \(\rho_3\), …, \(\rho_{12}\) — each representing the unknown probability that the sum will equal their value. That is, \(\rho_2\) is the probability of rolling a 2.

5.1.1 Flipping a coin

A mathematical distribution is based on a mathematical formula. Assuming that the coin is perfectly fair, we should, on average, get heads as often as we get tails.

An empirical distribution is based on data. You can think of this as the probability distribution created by running a simulation. In theory, if we increase the number of coins we flip in our simulation, the empirical distribution will look more and more similar to the mathematical distribution. The probability distribution is the Platonic form. The empirical distribution will often look like the mathematical probability distribution, but it will rarely be exactly the same.

In this simulation, there are 56 heads and 44 tails. The outcome will vary every time we run the simulation, but the proportion of heads to tails should not be too different if this coin is fair.

# We are flipping one fair coin a hundreds times. We need to get the same result
# each time we create this graphic because we want the results to match the
# description in the text. Using set.seed() guarantees that the random results
# are the same each time. We define 0 as tails and 1 as heads.

set.seed(3)

tibble(results = sample(c(0, 1), 100, replace = TRUE)) %>% 
  ggplot(aes(x = results)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 0.5, 
                   color = "white") +
    labs(title = "Empirical Probability Distribution",
         subtitle = "Flipping one coin a hundred times",
         x = "Outcome\nResult of Coin Flip",
         y = "Probability") +
    scale_x_continuous(breaks = c(0, 1), 
                       labels = c("Heads", "Tails")) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

A posterior distribution is based on beliefs and expectations. It displays your belief about things you can’t see right now. You may have posterior distributions for outcomes in the past, present, or future.

In the case of the coin toss, the posterior distribution changes depending on your beliefs. For instance, let’s say your friend brought a coin to school and asked to bet you. If the result is heads, you have to pay them $5.

This makes you suspicious. Your posterior distribution would reflect this. You might believe that \(\rho_h\) is 0.95 and \(\rho_t\) is 0.05.

The full terminology is mathematical (or empirical or posterior) probability distribution. But we will often shorten this to just mathematical (or empirical or posterior) distribution. The word “probability” is understood, even if it is not present.

5.1.2 Rolling two dice

Our mathematical distribution tells us that, with a fair dice, the probability of getting 1, 2, 3, 4, 5, and 6 are equal: there is a 1/6 chance of each. When we roll two dice at the same time and sum the numbers, the values closest to the middle are more common than values at the edge because there are more combinations of numbers that add up to the middle values.

We get an empirical distribution by rolling two dice a hundred times, either by hand or with a computer simulation. The result is not identical to the mathematical distribution because of the inherent randomness of the real world and/or of simulation.

# In the coin example, we create the vector ahead of time, and then assigned
# that vector to a tibble. There was nothing wrong with that approach. And we
# could do the same thing here. But the use of map_* functions is more powerful
# and will be necessary in later chapters. Still need set.seed(), as we almost
# always will when creating random objects.

set.seed(1)

emp_dist_dice <- tibble(ID = 1:100) %>% 
  mutate(die_1 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>% 
  mutate(die_2 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>% 
  mutate(sum = die_1 + die_2) %>% 
  ggplot(aes(x = sum)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 1, 
                   color = "white") +
    labs(title = "Empirical Probability Distribution",
         subtitle = "Sum from rolling two dice, replicated one hundred times",
         x = "Outcome\nSum of Two Die",
         y = "Probability") +
    scale_x_continuous(breaks = seq(2, 12, 1), labels = 2:12) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

emp_dist_dice

We might consider labeling the y-axis in plots of empirical distributions as “Proportion” rather than “Probability” since it is an actual proportion, calculated from real (or simulated) data. We will keep it as “Probability” since we want to emphasize the parallels between mathematical, empirical and posterior probability distributions.

The posterior distribution for rolling two dice a hundred times depends on your beliefs. If you take the dice from your Monopoly set, you have reason to believe that the assumptions underlying the mathematical distribution are true. However, if you walk into a crooked casino and a host asks you to play craps, you might be suspicious. In craps, a come-out roll of 7 and 11 is a “natural,” resulting in a win for the “shooter” and a loss for the casino. You might expect those numbers to occur less often than they would with fair dice. Meanwhile, a come-out roll of 2, 3 or 12 is a loss for the shooter. You might also expect values like 2, 3 and 12 to occur more frequently. Your posterior distribution might look like this:

Someone less suspicious of the casino would have a posterior distribution which looks more like the mathematical distribution.

5.1.3 Presidential elections

Now let’s say we are building probability distributions for political events, like a presidential election. We want to know the probability that Democratic candidate wins X electoral votes, where X comes from the range of possible outcomes: 0 to 538. (The total number of electoral votes in US elections since 1964 is 538.)

We can start with a mathematical distribution for X which assumes that the chances of the Democratic candidate winning any given state’s electoral votes is 0.5 and the results from each state are independent.

We know that campaign platforms, donations, charisma, and many other factors will contribute to a candidate’s success. Elections are more complicated than coin tosses. We also know that many presidential elections in history have resulted in much bigger victories or defeats than this distribution seems to allow for.

The empirical distribution in this case could involve looking into past elections in the United States and counting the number of electoral votes that the Democrats won in each. For the empirical distribution, we create a tibble with electoral vote results from past elections. Looking at elections since 1964, we can observe that the number of electoral votes that the Democrats received in each one is different. Given that we only have 15 entries, it is difficult to draw conclusions or make predictions based off of this empirical distribution.

However, this model is enough to suggest that the assumptions of the mathematical probability distribution above do not work for electoral votes. The model assumes that the Democrats have a 50% chance of receiving each of the 538 votes. Just looking at the mathematical probability distribution, we can observe that receiving 13 or 17 or 486 votes out of 538 would be extreme and almost impossible under this mathematical model. However, our empirical distribution tells us that those were real election results.

The posterior distribution of electoral votes is a popular topic, and an area of strong disagreement, among data scientists. Consider this posterior from FiveThirtyEight.

Here is a posterior from the FiveThirtyEight website from August 13, 2020. This was created using the same data as the above distribution, but simply displayed differently. For each electoral result, the height of the bar represents the probability that a given event will occur. However, there are no lablels y-axis telling us what the specific probability of each outcome is. And that is OK! The specific values are not that useful. If we removed the labels on our y-axes, would it matter?

Here is the posterior from The Economist, also from August 13, 2020. This looks confusing at first because they chose to merge the axes for Republican and Democratic electoral votes. We can tell that The Economist was less optimistic, relative to FiveThirtyEight, about Trump’s chances in the election.

These two models, built by smart people using similar data sources, have reached fairly different conclusions. Data science is difficult! There is not one “right” answer. Real life is not a problem set.

Watch the makers of these two models throw shade at each other on Twitter! Eliot Morris is one of the primary authors of the Economist model. Nate Silver is in charge of 538. They don't seem to be too impressed with each other's work! More smack talk [here](https://statmodeling.stat.columbia.edu/2020/08/31/more-on-that-fivethirtyeight-prediction-that-biden-might-only-get-42-of-the-vote-in-florida/) and [here](https://statmodeling.stat.columbia.edu/2020/08/31/problem-of-the-between-state-correlations-in-the-fivethirtyeight-election-forecast/).

FIGURE 5.2: Watch the makers of these two models throw shade at each other on Twitter! Eliot Morris is one of the primary authors of the Economist model. Nate Silver is in charge of 538. They don’t seem to be too impressed with each other’s work! More smack talk here and here.

There are many political science questions you could explore with posterior distributions. They can relate to the past, present, or future.

  • Past: How many electoral votes would Hilary Clinton have won if she had picked a different VP?
  • Present: What are the total campaign donations from Harvard faculty?
  • Future: How many electoral votes will the Democratic candidate for president win in 2024?

5.1.4 Continuous distributions

The three examples above are all discrete probability distributions, meaning that the outcome variable can only take on a limited set of values. A coin flip has two outcomes. The sum of a pair of dice has 11 outcomes. The total electoral votes for the Democratic candidate has 539 possible outcomes. In the limit, we can also create continuous probability distributions which have an infinite number of possible outcomes. For example, the average height for an American male could be any real number between 0 inches and 100 inches. (Of course, an average value anywhere near 0 or 100 is absurd. The point is that the average could be 68.564, 68.5643, 68.56432 68.564327, or any real number.)

All the characteristics for discrete probability distributions which we reviewed above apply just as much to continuous probability distributions. For example, we can create mathematical, empirical and posterior probability distributions for continuous outcomes just as we did for discrete outcomes. For example, a posterior probability distribution for \(p\), the percentage of US citizens who approve of Biden, might look like:

Comments:

  • The truth is out there. If we asked all 300+ million Americans whether or not they approve of President Biden, we could know \(p\) exactly. Alas, we can’t do that. We use a posterior probability distribution to summarize are beliefs about the true value of \(p\), a truth we can never confirm.

  • Continuous variables are a myth. Nothing that can be represented on a computer is truly continuous. Even something which appears continuous, like \(p\), actually can only take on a (very large) set of discrete variables. In this case, there are approximately 300 million possible true values of \(p\), one for each total number of people who approve of President Biden.

  • The math of continuous probability distributions can be tricky. Read a book on mathematical probability for all the messy details. Little of that matters in applied work.

  • The most important difference is that, with discrete distributions, it makes sense to estimate the probability of a specific outcome. What is the probability of rolling a 9? With continuous distributions, this makes no sense because there are an infinite number of possible outcomes. With continuous variables, we only estimate intervals. What is the probability that Biden’s true approval percentage is between 50% and 55%?

Don’t worry about the distinctions between discrete and continuous outcomes, or between the discrete and continuous probability distributions which we will use to summarize our beliefs about those outcomes. The basic intuition is the same in both cases.

5.1.5 Working with probability distributions

Bruno de Finetti, an Italian statistician who wrote a famous treatise on the theory of probability that began with the statement "PROBABILITY DOES NOT EXIST." This is because probability only exists subjectively in our minds.

FIGURE 5.3: Bruno de Finetti, an Italian statistician who wrote a famous treatise on the theory of probability that began with the statement “PROBABILITY DOES NOT EXIST.” This is because probability only exists subjectively in our minds.

A probability distribution is not always easy to work with. It is a complex object. And, in many contexts, we don’t really care about all that complexity. So, instead of providing the full probability distribution, we often just use a summary measure, a number or two or three which captures those aspects of the entire distribution which are relevant to the matter at hand. Let’s explore these issues using the 538 posterior probability distribution, as of August 13, 2020, for the number of electoral votes which will be won by Joe Biden. Here is a tibble with 1,000,000 draws from that distribution:

draws
## # A tibble: 1,000,000 x 2
##       ID electoral_votes
##    <int>           <int>
##  1     1             333
##  2     2             352
##  3     3             338
##  4     4             437
##  5     5             346
##  6     6             171
##  7     7             318
##  8     8             210
##  9     9             261
## 10    10             298
## # … with 999,990 more rows

A distribution and a sample of draws from that distribution are different things. But, if you squint, they are sort of the same thing, at least for our purposes. For example, if you want to know the mean of the distribution, then the mean of the draws will be a fairly good estimate, especially if the number of draws is large enough.

Recall from Chapter 2 how we can draw randomly from specified probability distributions:

rnorm(10)
##  [1]  0.030  0.732  0.091  0.263 -0.100 -1.533 -0.194  0.604 -1.913  0.345
runif(10)
##  [1] 0.128 0.618 0.975 0.543 0.374 0.920 0.709 0.210 0.048 0.936

The elements of these vectors are all “draws” from the specified probability distributions. In most applied situations, our tools will produce draws rather than summary objects. Fortunately, a vector of draws is very easy to work with. Start with summary statistics:

key_stats <- draws %>% 
  summarize(mn = mean(electoral_votes),
            md = median(electoral_votes),
            sd = sd(electoral_votes),
            mad = mad(electoral_votes))

key_stats
## # A tibble: 1 x 4
##      mn    md    sd   mad
##   <dbl> <dbl> <dbl> <dbl>
## 1  325.   326  86.9  101.

Calculate a 95% interval directly:

quantile(draws$electoral_votes, probs = c(0.025, 0.975))
## 2.5%  98% 
##  172  483

Approximate the 95% interval in two ways:

c(key_stats$mn - 2 * key_stats$sd, 
  key_stats$mn + 2 * key_stats$sd)
## [1] 152 499
c(key_stats$md - 2 * key_stats$mad, 
  key_stats$md + 2 * key_stats$mad)
## [1] 124 528

In this case, using the mean and standard deviation produces a 95% interval which is closer to the true interval. In other cases, the median and scaled median absolute deviation will do better. Either approximation is generally “good enough” for most work. But, if you need to know the exact 95% interval, you must use quantile().

5.1.6 Unnormalized distributions

Remember that probability distributions are mathematical objects that cover a set of outcomes, where each outcome in the domain is mapped to a probability value between 0 and 1 inclusive and the sum of all mappings is 1. Sometimes, you may see distributions similar to probability distributions, only the y-axis displays raw counts instead of proportions. Unnormalized distributions are not probability distributions, but it is easy to convert between the two. You simply divide all the outcome counts on the y-axis by the sum of all outcome counts to “normalize” the unnormalized distribution. Unnormalized distributions are often an intermediary step; it is sometimes handy to work with counts until the very end.

For instance, we can generate the following unnormalized distribution for the sum of rolling two dice. (This uses the same code as above, but without the normalization step.)

Notice that the shape of the distribution is the same as the empirical probability distribution we generated earlier, except that the y-axis is labeled differently.

The two plots — unnormalized and normalized — have the exact same shape. In many ways, they are the same object. Yet normalization is required if we want to work with a probability distribution.

5.1.7 Joint distributions

Recall that \(\text{Prob}(\text{coin})\) is the probability distribution for the result of a coin toss. It includes two parts, the probability of heads (\(\rho_h\)) and the probability of tails (\(\rho_t\)). This is a univariate distribution because there is only one outcome, which can be heads or tails. If there is more than one outcome, then we have a joint distribution.

Joint distributions are also mathematical objects that cover a set of outcomes, where each distinct outcome has a chance of occurring between 0 and 1 and the sum of all chances must equal 1. The key to a joint distribution is it measures the chance that both events A and B will occur. The notation is \(\text{Prob}(A, B)\).

Let’s say that you are rolling two six-sided dice simultaneously. Die 1 is weighted so that there is a 50% chance of rolling a 6 and a 10% chance of each of the other values. Die 2 is weighted so there is a 50% chance of rolling a 5 and a 10% chance of rolling each of the other values. Let’s roll both dice 1,000 times. In previous examples involving two dice, we cared about the sum of results and not the outcomes of the first versus the second die of each simulation. With a joint distributions, the order matters; so instead of 11 possible outcomes on the x-axis of our distribution plot (ranging from 2 to 12), we have 36. Furthermore, a 2D probability distribution is not sufficient to represent all of the variables involved, so the joint distribution for this example is displayed using a 3D plot.

5.2 Tree diagrams

5.2.1 Independence

So far, you have learned how to explore \(\text{Prob}(A)\), which is the fancy, statistical way of saying the probability distribution for event A. Keep in mind the distinction between an individual outcome from the set \(A\), like the probability of heads (which is \(\rho_h\)) and the entire distribution \(\text{Prob}(\text{coin})\), which includes the probability of each possible outcomes. \(\text{Prob}(A)\) is like \(\text{Prob}(\text{coin})\) or \(\text{Prob}(\text{sum of two dice})\).

What if you flipped two coins in a row, one after the other? You know that the probability of getting heads on the first coin is 1/2. But what are the odds of getting heads two times in a row? Let’s take a look at this tree diagram. We read this diagram from left to right. On the left, the probability of getting heads is 0.5 on the first toss. Then, the tree branches out.

  • If we got heads the first time, then we go up the top branch. The probability of getting heads again is 0.5.
  • If we got tails the first time, then we go down the bottom branch. The probability of getting heads is 0.5.

Notice how regardless of what we get the first time we flip the coin, the probability of getting heads is 0.5 throughout. Coin flips, in this scenario, are independent. The result of one coin flip does not impact the next coin flip.

5.2.2 Conditional probability

Imagine that 60% of people in a community have a disease. A doctor develops a test to determine if a random person has the disease. However, this test isn’t 100% accurate. There is an 80% probability of correctly returning positive if the person has the disease and 90% probability of correctly returning negative if the person does not have the disease.

The probability of a random person having the disease is 0.6. Since each person either has the disease or doesn’t (those are the only two possibilities), the probability that a person does not have the disease is \(1 - 0.6 = 0.4\).

Now the tree branches out.

  • If the random person has the disease, then we go up the top branch. The probability of an infected person testing positive is 0.8 because the test is 80% sure of correctly returning positive when the person has the disease.
  • By the same logic, if the random person does not have the disease, we go down the bottom branch. The probability of the person incorrectly testing positive is 0.1.

We decide to go down the top branch if our random person has the disease. We go down the bottom branch if they do not. This is called conditional probability. The probability of testing positive is dependent on whether the person has the disease.

How would you express this in statistical notation? \(\text{Prob}(A|B)\) is the same thing as the probability of A given B. \(\text{Prob}(A|B)\) essentially means the probability of A if we know for sure the value of B. Note that \(\text{Prob}(A|B)\) is not the same thing as \(\text{Prob}(B|A)\).

5.3 Two models

The simplest possible setting for inference involves two models — meaning two possible states of the world — and two outcomes from an experiment. Imagine that there is a disease — Probophobia, an irrational fear of probability — which you either have or don’t have. We don’t know if you have the diseases, but we do assume that there are only two possibilities. We also know that 1% of the people in the population suffer from Probophobia. Woe is them!

We also have a test which is 95% accurate. This is our experiment, one with only two possible outcomes: positive and negative.

Question: If you test positive, what is the probability that you have Probophobia?

More generally, we are estimating a conditional probability. Conditional on the outcome of a postive test, what is the probability that you have Probophobia? Mathematically, we want:

\[ \text{Prob}(\text{Disease | Test = Postive} ) \]

To answer this question, we need to use the tools of joint and conditional probability that we have learned earlier in this Chapter. We begin by building, by hand, the joint distribution of the possible models (you have the Probophobia or you do not) and of the possible outcomes (you test positive or negative). Building the joint distribution involves assuming that each model is true and then creating the distribution of outcomes which might occur if that assumption is true.

For example, assume you have Probophobia. There is then a 95% chance that you test positive and a 5% chance you test negative. Similarly, if we assume that the second model is true — that you don’t have Probophobia — then there is 5% chance you test positive and a 95% you chance negative. Of course, for you (or any individual) we do not know for sure what is happening. We do not know if you have the disease. We do not know what your test will show. But we can use these relationships to construct the joint distribution.

# It is often handy to bring out a key parameter rather than searching for it in
# the code every time you want to change it. When testing this code, we would
# set `sims` to a number like 3, just to make sure everything is working.

sims <- 100000

# Pipes generally start with tibbles, so we start with a tibble which just
# includes an ID variable. We don't really use ID. It is just handy for getting
# organized. We call this object `jd_disease`, where the `jd` standds for
# *j*oint *d*istribution.

jd_disease <- tibble(ID = 1:sims) %>% 
  
  # For each ID, we (randomly!) determine whether or not the person has
  # Probophobia. In the population, the rate is only 1%, thus explaining the
  # `prob` value of 0.01. Note that the call to `rbinom()` does not use the
  # value of the `ID` variable. 
  
  mutate(have_disease = map_int(ID, ~ rbinom(n = 1, 
                                             size = 1, 
                                             prob = 0.01))) %>% 
  
  # For each person, we (randomly!) decide whether or not their test comes back
  # positive. The test is 95% accurate, explaining the use of 0.95 in
  # `rbinom()`. Then the hackery begins! rbinom() returns 1 or 0, which
  # `ifelse()` treats as TRUE or FALSE. For TRUE, `test_positive` is assigned
  # the same value as `have_disease`, which is what we want. The test was
  # accurate. This is the single `.`. But 5% of the time, rbinom() will return a
  # 0, meaning FALSE, meaning that we assign the opposite value --- generated by
  # `! .` --- of `have_disease`. The use of `map_int()` ensures that we get a 0
  # or 1 as the return value.
  
  mutate(test_positive = map_int(have_disease,
                                 ~ ifelse(rbinom(n = 1, 
                                          size = 1, 
                                          prob = 0.95),
                                          .,
                                          ! .)))

jd_disease
## # A tibble: 100,000 x 3
##       ID have_disease test_positive
##    <int>        <int>         <int>
##  1     1            0             0
##  2     2            0             0
##  3     3            0             0
##  4     4            0             0
##  5     5            0             0
##  6     6            0             0
##  7     7            0             0
##  8     8            0             1
##  9     9            0             0
## 10    10            1             1
## # … with 99,990 more rows

Most values for have_diseases are zero since 99% of the population does not suffer from Probophobia. Most of those will have a negative test result, meaning test_positive = 0. However, because the test is not perfect, some of those, like ID = 8, will have a “false positive” result. Other individuals, like ID = 10, will have the diseases, and 95% of those will have a “true positive” result, as 10 does.

Plot the joint distribution:

jd_disease %>% 
  ggplot(aes(x = as.factor(test_positive), 
             y = as.factor(have_disease))) +
  geom_point() +
  geom_jitter(alpha = .2) +
  labs(title = "Unnormalized Distribution of Test Results and Disease Status",
       subtitle = "Rare diseases have many false positives",
       x = "Test Result",
       y = "Disease Status") +
  scale_x_discrete(breaks = c(0, 1), 
                   labels = c("Negative", "Positive")) +
  scale_y_discrete(breaks = c(0, 1), 
                   labels = c("Negative", "Positive")) +
  theme_classic()

Below is a joint distribution displayed in 3D. Instead of using the “jitter” feature in R to unstack the dots, we are using a 3D plot to visualize the number of dots in each box. The number of people who correctly test negative is far greater than of the other categories. There are far fewer false positives and true positives – and only a handful of false negatives. This is why we can barely see the 3D bar coming from those combinations.

This Section is called “Two Models” because, for each person, there are two possible states of the world: have the disease or not have the disease. By assumption, there are no other outcomes. We call these two possible states of the world “models,” even though they are very simple models.

In addition to the two models, we have two possible results of our experiment on a given person: test positive or test negative. Again, this is an assumption. We do not allow for any other outcome. In coming sections, we will look at more complex situations where we consider more than two models and more than two possible results of the experiment. In the meantime, we have built the unnormalized joint distribution for models and results. This is a key point! Look back earlier in this Chapter for discussions about both unnormalized distributions and joint distributions.

We want to analyze these plots by looking at different slices. For instance, let’s say that you have tested positive for the disease. Since the test is not always accurate, you cannot be 100% certain that you have it. We isolate the slice where the test result equals 1 (meaning positive).

jd_disease %>% 
  filter(test_positive == 1)
## # A tibble: 5,946 x 3
##       ID have_disease test_positive
##    <int>        <int>         <int>
##  1     8            0             1
##  2    10            1             1
##  3    19            0             1
##  4    29            0             1
##  5    47            0             1
##  6    51            1             1
##  7    57            0             1
##  8    77            0             1
##  9    78            1             1
## 10    82            0             1
## # … with 5,936 more rows

Most people who test positive do not have the disease! This is a common result for rare diseases. We can easily create an unnormalized conditional distribution with:

jd_disease %>% 
  filter(test_positive == 1) %>% 
  ggplot(aes(have_disease)) +
    geom_bar() +
    labs(title = "Disease Status Given Positive Test",
         subtitle = "Rare diseases generates mostly false positives",
         x = "Disease Status",
         y = "Count") +
    scale_x_continuous(breaks = c(0, 1),
                       labels = c("Healthy", "Infected")) +
  theme_classic()

filter() transforms a joint distribution into a conditional distribution.

Turn this unnormalized distribution into a posterior probability distribution:

jd_disease %>% 
  filter(test_positive == 1) %>% 
  ggplot(aes(have_disease)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 0.25, 
                   color = "white") +
    labs(title = "Posterior for Probophobia Conditional on Positive Test",
         x = "Probophobia Status",
         y = "Probability") +
    scale_x_continuous(breaks = c(0, 1),
                       labels = c("Healthy", "Infected")) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()  

Even with a positive test, you are more than 80% likely to be healthy.

This Stat 110 Animations video does a really good job of explaining similar concepts.

5.4 Three models

Imagine that your friend gives you a bag with two marbles. There could either be two white marbles, two black marbles, or one of each color. Thus, the bag could contain 0% white marbles, 50% white marbles, or 100% white marbles. The proportion, \(p\), of white marbles could be 0, 0.5, or 1.

Let’s say you take a marble out of the bag, record whether it’s black or white, then return it to the bag. You repeat this three times, observing the number of white marbles you see out of three trials. You could get three whites, two whites, one white, or zero whites as a result of this trial. Let’s make what we call a Bayes scatterplot out of this. We have three models (three different proportions of white marbles in the bag) and four possible experimental results. Let’s create 3,000 draws from this joint distribution:

# Create the joint distribution of the number of white marbles in the bag
# (in_bag) and the number of white marbles pulled out in the sample (in_sample),
# one-by-one. in_bag takes three possible values: 0, 1 and 2, corresponding to
# zero, one and two white marbles potentially in the bag.

sims <- 10000

# We also start off with a tibble. It just makes things easier

jd_marbles <- tibble(ID = 1:sims) %>% 
  
  # For each row, we (randomly!) determine the number of white marbles in the
  # bag. We do not know why the `as.integer()` hack is necessary. Shouldn't
  # `map_int()` automatically coerce the result of `sample()` into an integer?
  
  mutate(in_bag = map_int(ID, ~ as.integer(sample(c(0, 1, 2), 
                                                  size = 1)))) %>%
  
  # Depending on the number of white marbles in the bag, we randomly draw out 0,
  # 1, 2, or 3 white marbles in our experiment. We need `p = ./2` to transform
  # the number of white marbles into the probability of drawing out a white
  # marble in a single draw. That probability is either 0%, 50% or 100%.
  
  mutate(in_sample = map_int(in_bag, ~ rbinom(n = 1, 
                                              size = 3, 
                                              p = ./2))) 

jd_marbles
## # A tibble: 10,000 x 3
##       ID in_bag in_sample
##    <int>  <int>     <int>
##  1     1      0         0
##  2     2      2         3
##  3     3      0         0
##  4     4      2         3
##  5     5      0         0
##  6     6      0         0
##  7     7      0         0
##  8     8      0         0
##  9     9      2         3
## 10    10      0         0
## # … with 9,990 more rows

Plot the joint distribution:

# The distribution is unnormalized. All we see is the number of outcomes in each
# "bucket." Although it is never stated clearly, we are assuming that there is
# an equal likelihood of 0, 1 or 2 white marbles in the bag.

jd_marbles %>%
  ggplot(aes(x = in_sample, y = in_bag)) +
    geom_jitter(alpha = 0.5) +
    labs(title = "Black and White Marbles",
         subtitle = "More white marbles in bag mean more white marbles selected",
         x = "White Marbles Selected",
         y = "White Marbles in the Bag") +
    scale_y_continuous(breaks = c(0, 1, 2)) +
  theme_classic()

Here is the 3D visualization:

The y-axes of both the scatterplot and the 3D visualization are labeled “Number of White Marbles in the Bag.” Each value on the y-axis is a model, a belief about the world. For instance, when the model is 0, we have no white marbles in the bag, meaning that none of the marbles we pull out in the sample will be white.

Let’s say we draw out three white marbles. We isolate the slice where the result of the simulation involves three white marbles and zero black ones. Here is the unnormalized probability distribution.

# The key step is the filter. Creating a conditional distribution from a joint
# distribution is the same thing as filtering that joint distribution for a
# specific value. A conditional distribution is a "slice" of the joint
# distribution, and we take that slice with filter().

jd_marbles %>% 
  filter(in_sample == 3) %>% 
  ggplot(aes(in_bag)) +
    geom_histogram(binwidth = 0.5, color = "white") +
    labs(title = "Unnormalized Conditional Distribution",
         subtitle = "Number of white marbles in bag given that three were selected in the sample",
         x = "Number of White Marbles in the Bag",
         y = "Count") +
    coord_cartesian(xlim = c(0, 2)) +
    scale_x_continuous(breaks = c(0, 1, 2)) +
    theme_classic()

Next, let’s normalize the distribution.

jd_marbles %>% 
  filter(in_sample == 3) %>% 
  ggplot(aes(in_bag)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 0.5, 
                   color = "white") +
    labs(title = "Posterior Probability Distribution",
         subtitle = "Number of white marbles in bag given that three were selected in the sample",
         x = "Number of White Marbles in the Bag",
         y = "Probability") +
    coord_cartesian(xlim = c(0, 2)) +
    scale_x_continuous(breaks = c(0, 1, 2)) +
    scale_y_continuous(labels = 
                         scales::percent_format(accuracy = 1)) +
    theme_classic()

This plot makes sense because when all three marbles you draw out of the bag are white, there is a pretty good chance that there are no black marbles in the bag. But you can’t be certain! It is possible to draw three white even if the bag contains one white and one black. However, it is impossible that there are zero white marbles in the bag.

5.5 N models

Assume that there is a coin with \(\rho_h\). We guarantee that there are only 11 possible values of \(\rho_h\): \(0, 0.1, 0.2, ..., 0.9, 1\). In other words, there are 11 possible models, 11 things which might be true about the world. This is just like situations we have previously discussed, except that there are more models to consider.

We are going to run an experiment in which you flip the coin 20 times and record the number of heads. What does this result tell you about the value of \(\rho_h\)? Ultimately, we will want to calculate a posterior distribution of \(\rho_h\), which is written as p(\(\rho_h\)).

To start, it is useful to consider all the things which might happen if, for example, \(\rho_h = 0.4\). Fortunately, the R functions for simulating random variables makes this easy.

First, notice that many different things can happen! Even if we know, for certain, that \(\rho_h = 0.4\), many outcomes are possible. Life is remarkably random. Second, the most likely result of the experiment is 8 heads, as we would expect. Third, we have transformed the raw counts of how many times each total appeared into a probability distribution. Sometimes, however, it is convenient to just keep track of the raw counts. The shape of the figure is the same in both cases.

Either way, the figures show what would have happened if that model — that \(\rho_h = 0.4\) — were true.

We can do the same thing for all 11 possible models, calculating what would happen if each of them were true. This is somewhat counterfactual since only one of them can be true. Yet this assumption does allow us to create the joint distribution of models which might be true and of data which our experiment might generate. Let’s simplify this is p(models, data), although you should keep the precise meaning in mind.

Here is the 3D version of the same plot.

In both of these diagrams, we see 11 models and 21 outcomes. We don’t really care about the p(\(models\), \(data\)), the joint distribution of the models-which-might-be-true and the data-which-our-experiment-might-generate. Instead, we want to estimate \(p\), the unknown parameter which determines the probability that this coin will come up heads when tossed. The joint distribution alone can’t tell us that. We created the joint distribution before we had even conducted the experiment. It is our creation, a tool which we use to make inferences. Instead, we want the conditional distribution, p(\(models\) | \(data = 8\)). We have the results of the experiment. What do those results tell us about the probability distribution of \(p\)?

To answer this question, we simply take a vertical slice from the joint distribution at the point of the x-axis corresponding to the results of the experiment.

This animation shows what we want to do with joint distributions. We take a slice (the red one), isolate it, rotate it to look at the conditional distribution, normalize it (change the values along the current z-axis from counts to probabilities), then observe the resulting posterior.

This is the only part of the joint distribution that we care about. We aren’t interested in what the object looks like where, for example, the number of heads is 11. That portion is irrelevant because we observed 8 heads, not 11. By using the filter function on the simulation tibble we created, we can conclude that there are a total of 465 times in our simulation in which 8 heads were observed.

As we would expect, most of the time when 8 coin tosses came up heads, the value of \(p\) was 0.4. But, on numerous occasions, it was not. It is quite common for a value of \(p\) like 0.3 or 0.5 to generate 8 heads. Consider:

Yet this is a distribution of raw counts. It is an unnormalized density. To turn it into a proper probability density (i.e., one in which the sum of the probabilities across possible outcomes sums to one) we just divide everything by the total number of observations.

The most likely value of \(\rho_h\) is still 0.4, as before. But, it is much more likely that \(p\) is either 0.3 or 0.5. And there is about an 8% chance that \(\rho_h \ge 0.6\).

You might be wondering: what is the use of a model? Well, let’s say we toss the coin 20 times and get 8 heads again. Given this result, what is the probability that future samples of 20 flips will result in 10 or more heads?

There are three main ways you could go about solving this problem with simulations.

The first wrong way to do this is assuming that \(\rho_h\) is certain because we observed 8 heads after 20 tosses. We would conclude that 8/20 gives us 0.4. The big problem with this is that you are ignoring your uncertainty when estimating \(\rho_h\). This would lead us to the following code.

sims <- 10000000

odds <- tibble(sim_ID = 1:sims) %>%
  mutate(heads = map_int(sim_ID, ~ rbinom(n = 1, size = 20, p = .4))) %>% 
  mutate(result = ifelse(heads >= 10, TRUE, FALSE)) %>% 
  summarize(success = sum(result)/sims)

odds
## # A tibble: 1 x 1
##   success
##     <dbl>
## 1   0.245

The second method involves sampling the whole posterior distribution vector we previously created. This would lead to the following correct code.

p_draws <- tibble(p = rep(seq(0, 1, 0.1), 1000)) %>%
  mutate(heads = map_int(p, ~ rbinom(n = 1, size = 20, p = .))) %>%
  filter(heads == 8)
  
odds <- tibble(p = sample(p_draws$p, size = sims, replace = TRUE)) %>%
  mutate(heads = map_int(p, ~ rbinom(n = 1, size = 20, p = .))) %>% 
  mutate(result = ifelse(heads >= 10, TRUE, FALSE)) %>% 
  summarize(success = sum(result)/sims)

odds
## # A tibble: 1 x 1
##   success
##     <dbl>
## 1   0.328

Third way is to sample from the actual distribution, which is a small dataset with just a few rows, but it includes both \(\rho_h\) and the probability of each \(\rho_h\). This also gives the correct answer.

p_posterior <- jd_coin %>% 
  filter(heads == 8) %>% 
  group_by(p) %>% 
  summarize(total = n(), .groups = "drop") %>%
  mutate(probs = total/sum(total))

odds <- tibble(p = sample(p_posterior$p, 
                          size = sims, 
                          prob = p_posterior$probs, 
                          replace = TRUE)) %>%
  mutate(heads = map_int(p, ~ rbinom(n = 1, 
                                     size = 20, 
                                     p = .))) %>% 
  mutate(result = ifelse(heads >= 10, 
                         TRUE, 
                         FALSE)) %>% 
  summarize(success = sum(result)/sims)

odds
## # A tibble: 1 x 1
##   success
##     <dbl>
## 1   0.320

As you may have noticed, if you calculated the value using the first method, you would believe that getting 10 or more heads is less likely than it really is. If you were to run a casino based on these assumptions, you will lose all your money. It is very important to be careful about the assumptions you are making. We tossed a coin 20 times and got 8 heads. However, you would be wrong to assume that \(\rho_h\) = 0.4 just based on this result.

5.6 Cardinal Virtues

The four Cardinal Virtues are Wisdom, Justice, Courage, and Temperance. Because data science is, ultimately, a moral act, we use these virtues to guide our work. Wisdom highlights the importance of exploring your data. Given the questions you seek to answer, is the data valid? Is it representative? Is it ethical to proceed? Justice focuses on the Preceptor Table and the mathematical structure of the model. The virtue of Courage allows us to go from words and math to code, the hardest step in the data science process. Temperance guides us in the use of the model we have created to answer the questions we began with. We should be modest in the claims we make, and wary of naive attempts to “test” them.

5.6.1 Wisdom

Wisdom.

FIGURE 5.4: Wisdom.

Wisdom encompasses four topics: data, validity, the population and ethics.

First, we look at the data and perform an exploratory data analysis, an EDA. You can never look at your data too much.

Second, confirm that the data we have accurately capturing the concepts we care about. Is the data valid, given the problem we are trying to solve? Gelman, Hill, and Vehtari (2020) write:

Most important is that the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. Optimally, this means that the outcome measure should accurately reflect the phenomenon of interest, the model should include all relevant predictors, and the model should generalize to the cases to which it will be applied.

For example, with regard to the outcome variable, a model of incomes will not necessarily tell you about patterns of total assets. A model of test scores will not necessarily tell you about child intelligence or cognitive development. …

A sample that is representative of all mothers and children may not be the most appropriate for making inferences about mothers and children who participate in the Temporary Assistance for Needy Families program. … Similarly, results regarding diet and exercise obtained from a study performed on patients at risk for heart disease may not be generally applicable to generally healthy individuals.

We want there to be a connection between the problem we face and the data we have. But desire is a poor replacement for substance. If the data we have is not relevant, then data science is useless.

Third, a key concept is the “population,” which is largely equivalent to the rows of the ideal Preceptor Table, as discussed in Chapter 4. Gelman, Hill, and Vehtari (2020) write:

A regression model is fit to data and is used to make inferences about a larger population, hence the implicit assumption in interpreting regression coefficients is that the sample is representative of the population.

To be more precise, the key assumption is that the data are representative of the distribution of the outcome \(y\) given the predictors \(x_1\), \(x_2\), …, that are included in the model. For example, in a regression of earnings on height and sex, it would be acceptable for women and tall people to be overrepresented in the sample, compared to the general population, but problems would arise if the sample includes too many rich people. Selection on x does not interfere with inferences from the regression model, but selection on y does. This is one motivation to include more predictors in our regressions, to allow the assumption of representativeness, conditional on X, to be more reasonable.

Representativeness is a concern even with data that would not conventionally be considered as a sample. For example, the forecasting model in Section 7.1 contains data from 16 consecutive elections, and these are not a sample from anything, but one purpose of the model is to predict future elections. Using the regression fit to past data to predict the next election is mathematically equivalent to considering the observed data and the new outcome as a random sample from a hypothetical superpopulation. Or, to be more precise, it is like treating the errors as a random sample from the normal error distribution. For another example, if a regression model is fit to data from the 50 states, we are not interested in making predictions for a hypothetical 51st state, but you may well be interested in the hypothetical outcome in the 50 states in a future year. As long as some generalization is involved, ideas of statistical sampling arise, and we need to think about representativeness of the sample to the implicit or explicit population about which inferences will be drawn. This is related to the idea of generative modeling, as discussed in Section 4.1.

Validity is about the columns in our tibble. Representativeness is about the rows.

Fourth, even if we can make a model, we should always ask ourselves: Should we? Ethics matter.

5.6.2 Justice

Justice.

FIGURE 5.5: Justice.

There are three key aspects of Justice: the Preceptor Table, predictive versus causal models, and mathematical formula.

5.6.2.1 Preceptor Table

Precision matters. What do you know and what do you want to know? Creating a Preceptor Table forces you to be clear about what you are trying to accomplish and the problems you will face along the way.

A Preceptor Table is a table with rows and columns for all the data that we have and would (reasonably) like to have. If we had a Preceptor Table with no questions marks — which we describe below as the ideal Preceptor Table — we could calculate the number we want without the need for any inference.

First, the infinite Preceptor Table is well described in Chapter 4. We never use these, but it useful to understand them, and the assumptions we must make to work with something simpler.

Second, the ideal Preceptor Table is the Preceptor Table with no question marks, and with a reasonable number of rows and columns. Conceptually, this is the heart of the analysis.

Third, the actual Preceptor Table is the Preceptor Table with all the annoying question marks which the real world saddles us with.

The central problem in inference is to fill in the question marks on the actual Preceptor Table.

Note that the concept of a Preceptor Table is more subtle than it at first appears. For example, consider a problem in which we are using party to predict income with the trains data. Naively, it might look like we don’t have any question marks. For all 115 rows in the data, we have party and income. But the Preceptor Table we really need has more than 115 rows because we are trying to draw inferences about people not in the sample of 115. What if a new person shows up, tells us their party, and asks us to guess their income? The ideal Preceptor Table would include their row, with the data for party present but the data for income a question mark.

5.6.2.2 Predictive versus causal models

Are we are modeling (just) for prediction or are we (also) modeling for causation? Predictive models care nothing about causation. Causal models are often also concerned with prediction, if only as a means of measuring the quality of the model.

Every model is predictive, in the sense that, if we give you new data — and it is drawn from a stable distribution — then you can create a predictive forecast. But only a subset of those models are causal, meaning that, for a given individual, you can change the value of one input and figure out what the new output would be and then, from that, calculate the causal effect.

With prediction, all we care about is forecasting Y given X on some as-yet-unseen data — implying that the Preceptor Table has rows in which we know all the X but not the Y. But there is no notion of “manipulation” in such models. We don’t pretend that, for Joe, we could turn variable X from a value of 5 to a value of 30 by just turning some knob and, by doing so, cause Joe’s value of Y to change from 17 to 23. We can compare two people (or two groups of people), one with X equal to 5 and one with X equal to 30, and see how they differ in Y. The basic assumption of predictive models is that there is only one possible Y for Joe. There are not, by assumption, two possible values for Y, one if X equal 5 and another if X equals 30. The Preceptor Table has a single column under Y.

With causal inference, however, we can consider the case of Joe with \(X = 5\) and Joe with \(X = 30\). The same mathematical model can be used. And both models can be used for prediction, for estimating what the value of Y will be for a yet-unseen observation with values X. But, in this case, instead of only a single column in the Preceptor Table for Y, we have at least two (and possibly many) such columns, one for each of the potential outcomes under consideration.

The difference between prediction models and causal models is that the former have one column for Y and the latter have more than one.

A related issue is the different types of Preceptor Tables. All of these are Preceptor Tables, but it is useful, I think, to have a clear description of the different types.

5.6.2.3 Mathematical Model

Justice requires (simple) math. Consider a model of coin-tossing:

\[ T_i \sim B(p_H, n = 20) \] The total number \(T\) of Heads in experiment \(i\) with 20 flips of a single coin, \(T_i\), is distributed as a binomial with \(n = 20\) and an unknown probability \(\rho_h\) of the coin coming up Heads.

Note:

  • This is a cheat and a simplification! We are Bayesians but we have not specified the full Bayesian machinery. We really need priors on the unknown parameter \(\rho_h\) as well. But that is too complex for an introductory class, so we wave our hands, accept the default sensible parameters built into the functions we use and point readers to more advanced books, like Gelman, Hill, and Vehtari (2020).

  • Defining \(\rho_h\) as the “the probability that the coin comes up Heads” is a bit of a fudge. If you calculate that by hand and then compare it to what our tools produce, they won’t be the same. Instead, the calculated value will be closer to zero. Why? \(\rho_h\) is really the “long-run percentage of the time the coin comes up Heads.” It is not just the percentage from this experiment.

  • In this simple case, we are fortunate that the parameter \(\rho_h\) has such a (mostly!) simple analog to a real world quantity. Much of the time, parameters are not so easy to interpret. In a more complex model, especially with one with interaction terms, we focus less on parameters and more on actual predictions.

5.6.3 Courage

Courage.

FIGURE 5.6: Courage.

The three languages of data science are words, math and code, and the most important of these is code. We need to explain the structure of our model using all three languages, but we need Courage to implement the model in code.

Code allows us to “fit” a model by estimating the values of the unknown parameters, like \(\rho_h\). Sadly, we can never know the true values of these parameters. But, like all good data scientists, we can express our uncertain knowledge in the form of posterior probability distributions. With those distributions, we can compare the actual values of the outcome variable with the “fitted” or “predicted” results of the model. We can examine the “residuals,” the difference between the fitted and actual values.

Every outcome is the sum of two parts: the model and what is not in the model:

\[outcome = model + what\ is\ not\ in\ the\ model\]

It doesn’t matter what the outcome is. It could be the result of a coin flip, the weight of a person, the GDP of a country. Whatever outcome we are considering is always made up of two parts. The first is the model we have created. The second is all the stuff — all the blooming and buzzing complexity of the real world — which is not a part of the model.

Some of our uncertainty is driven by our ignorance about \(\rho_h\).

A parameter is something which does not exist in the real world. (If it did, or could, then it would be data.) Instead, a parameter is a mental abstraction, a building block which we will use to to help us accomplish our true goal: To replace at least some of the questions marks in the Preceptor Table. Since parameters are mental abstractions, we will always be uncertain as to their value, however much data we might collect.

But some, often most, of the uncertainty comes from forces that are, by assumption, not in the model. For example, if the coin is fair, we expect \(T_i\) to equal 10. But, often, it will be different, even if we are correct and \(\rho_h\) equals exactly 0.5. Some randomness is intrinsic in this fallen world.

5.6.4 Temperance

Temperance.

FIGURE 5.7: Temperance.

There are few more important concepts in statistics and data science than the “Data Generating Mechanism.” Our data — the data that we collect and see — has been generated by the complexity and confusion of the world. God’s own mechanism has brought His data to us. Our job is to build a model of that process, to create, on the computer, a mechanism which generates fake data consistent with the data which we see. With that DGM, we can answer any question which we might have. In particular, with the DGM, we provide predictions of data we have not seen and estimates of the uncertainty associated with those predictions. Courage helped us to create the DGM. Temperance will guide us in its use.

Having created (and checked) a model, we now use the model to answer questions. Models are made for use, not for beauty. The world confronts us. Make decisions we must. Our decisions will be better ones if we use high quality models to help make them.

Sadly, our models are never as good as we would like them to be. First, the world is intrinsically uncertain.

Donald Rumsfeld.

FIGURE 5.8: Donald Rumsfeld.

There are known knowns. There are things we know we know. We also know there are known unknowns. That is to say, we know there are some things we do not know. But there are also unknown unknowns, the ones we do not know we do not know. – Donald Rumsfeld

What we really care about is data we haven’t seen yet, mostly data from tomorrow. But what if the world changes, as it always does? If it doesn’t change much, maybe we are OK. If it changes a lot, then what good will our model be? In general, the world changes some. That means that are forecasts are more uncertain that a naive use of our model might suggest.

Three Card Monte.

FIGURE 5.9: Three Card Monte.

What does this mean? Well imagine a crowd playing Three Card Monte in the streets of New York. The guy running the game runs a demo and shows you all the cards to make you confident. They earn money by making you overconfident and persuading you to bet. Your odds may seem good during the demo round, but that doesn’t actually say anything about what will likely happen when the real, high stakes game begins. The person running the game does many simulations, making the “victim” forget that they cannot actually make any conclusions about the odds of winning. There are some variables that we simply do not know even if we put a lot of effort into making posterior probability distributions. People can be using slight of hand, for instance.

We need patience in order to study and understand the unknown unknowns in our data. Patience is also important when we analyze the “realism” of our models. When we created the mathematical probability distribution for presidential elections, for instance, we assumed that the Democratic candidate would have a 50% chance of winning each vote in the electoral college. By comparing the mathematical model to our empirical cases, however, we recognize that the mathematical model is unlikely to be true. The mathematical model suggested that getting fewer than 100 votes is next to impossible, but many past Democratic candidates in the empirical distribution received less than 100 electoral votes.

In Temperance, the key distinction is between the true posterior distribution — what we will call “Preceptor’s Posterior” — and the estimated posterior distribution. Recall our discussion from Section 2.8. Imagine that every assumption we made in Wisdom and Justice were correct, that we correctly understand every aspect of how the world works. We still would not know the unknown value we are trying to estimate — recall the Fundamental Problem of Causal Inference — but the posterior we created would be perfect. That is Preceptor’s Posterior. Sadly, even if our estimated posterior is, very close to Preceptor’s Posterior, we can never be sure of that fact, because we can never know the truth, never be certain that all the assumptions we made are correct.

Even worse, we must always worry that our estimated posterior, despite all the work we put into creating it, is far from the truth. We, therefore, must be cautious in our use of that posterior, humble in our claims about its accuracy. Using our posterior, despite its fails, is better than not using it. Yet it is, as best, a distorted map of reality, a glass through which we must look darkly. Use your posterior with humility.

5.7 Summary

Validity is about the columns in our tibble. Representativeness is about the rows.

Throughout this chapter, we spent time going through examples of conditional distributions. However, it’s worth noting that all probability distributions are conditional on something. Even in the most simple examples, when we were flipping a coin multiple times, we were assuming that the probability of getting heads versus tails did not change between tosses.

We also discussed the difference between empirical, mathematical, and posterior probability distributions. Even though we developed these heuristics to better understand distributions, every time we make a claim about the world, it is based on our beliefs - what we think about the world. We could be wrong. Our beliefs can differ. Two reasonable people can have conflicting beliefs about the fairness of a die.

It is useful to understand the three types of distributions and the concept of conditional distributions, but almost every probability distribution is conditional and posterior. We can leave out both words in future discussions, as we generally will in this book. They are implicit.

If you are keen to learn more about probability, here is a video featuring Professor Gary King. This is a great way to review some of the concepts we covered in this chapter, albeit at a higher level of mathematics.